Method and Apparatus for Unpacking Packed Data

ABSTRACT

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.

RELATED APPLICATIONS

The present patent application is a continuation of U.S. patentapplication Ser. No. 10/185,896, filed on Jun. 27, 2002, now pending,which is a divisional of U.S. patent application Ser. No. 09/657,447,filed on Sep. 8, 2000, now U.S. Pat. No. 6,516,406, which is acontinuation of U.S. patent application Ser. No. 08/974,435, filed onNov. 20, 1997, now U.S. Pat. No. 6,119,216, which is a divisional ofU.S. patent application Ser. No. 08/791,003, filed on Jan. 27, 1997, nowU.S. Pat. No. 5,802,336, which is a continuation of U.S. patentapplication Ser. No. 08/349,047, filed on Dec. 2, 1994, now abandoned.The following U.S. patent application Ser. Nos. are hereby incorporatedherein by reference: 10/185,896; 09/657,447; 08/974,435; 08/791,003; and08/349,047.

FIELD OF DISCLOSURE

The present invention includes an apparatus and method of performingoperations using a single control signal to manipulate multiple dataelements. The present invention allows execution of move, pack andunpack operations on packed data types.

BACKGROUND OF THE DISCLOSURE

Today, most personal computer systems operate with one instruction toproduce one result. Performance increases are achieved by increasingexecution speed of instructions and the processor instructioncomplexity, and by performing multiple instructions in parallel; knownas Complex Instruction Set Computer (CISC). Such processors as the Intel80386™ microprocessor, available from Intel Corp. of Santa Clara,Calif., belong to the CISC category of processor.

Previous computer system architecture has been optimized to takeadvantage of the CISC concept. Such systems typically have data busesthirty-two bits wide. However, applications targeted at computersupported cooperation (CSC—the integration of teleconferencing withmixed media data manipulation), 2D/3D graphics, image processing, videocompression/decompression, recognition algorithms and audio manipulationincrease the need for improved performance. But, increasing theexecution speed and complexity of instructions is only one solution.

One common aspect of these applications is that they often manipulatelarge amounts of data where only a few bits are important. That is, datawhose relevant bits are represented in much fewer bits than the size ofthe data bus. For example, processors execute many operations on eightbit and sixteen bit data (e.g., pixel color components in a video image)but have much wider data busses and registers. Thus, a processor havinga thirty-two bit data bus and registers, and executing one of thesealgorithms, can waste up to seventy-five percent of its data processing,carrying and storage capacity because only the first eight bits of dataare important.

As such, what is desired is a processor that increases performance bymore efficiently using the difference between the number of bitsrequired to represent the data to be manipulated and the actual datacarrying and storage capacity of the processor.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example, and notlimitation, in the figures. Like references indicate similar elements.

FIG. 1 illustrates an embodiment of the computer system using themethods and apparatus of the present invention.

FIG. 2 illustrates an embodiment of the processor of the presentinvention.

FIG. 3 is a flow diagram illustrating the general steps used by theprocessor to manipulate data in the register file.

FIG. 4 a illustrates memory data types.

FIG. 4 b, FIG. 4 c and FIG. 4 d illustrate in-register integer datarepresentations.

FIG. 5 a illustrates packed data types.

FIG. 5 b, FIG. 5 c and FIG. 5 d illustrate in-register packed datarepresentations.

FIG. 6 a illustrates a control signal format used in the computer systemto indicate the use of packed data.

FIG. 6 b illustrates a second control signal format that can be used inthe computer system to indicate the use of packed data or integer data.

FIG. 7 illustrates one embodiment of a method followed by a processorwhen performing a pack operation on packed data.

FIG. 8 a illustrates a circuit capable of implementing a pack operationon packed byte data.

FIG. 8 b illustrates a circuit capable of implementing a pack operationon packed word data.

FIG. 9 illustrates on embodiment of a method followed by a processorwhen performing an unpack operation on packed data.

FIG. 10 illustrates a circuit capable of implementing an unpackoperation on packed data.

DETAILED DESCRIPTION

A processor having move, pack, and unpack operations that operate onmultiple data elements is described. In the following description,numerous specific details are set forth such as circuits, etc., in orderto provide a thorough understanding of the present invention. In otherinstances, well-known structures and techniques have not been shown indetail in order not to unnecessarily obscure the present invention.

DEFINITIONS

To provide a foundation for understanding the description of theembodiments of the present invention, the following definitions areprovided.

Bit X defines a subfield of binary number. For example, bit six throughthrough bit zero of the byte 00111010₂ (shown in base two) Bit Y:represent the subfield 111010₂. The ‘2’ following a binary numberindicates base 2. Therefore, 1000₂ equals 8₁₀, while F₁₆ equals 15₁₀.R_(x): is a register. A register is any device capable of storing andproviding data. Further functionality of a register is described below.A register is not necessarily part of the processor's package. DEST: isa data address. SRC1: is a data address. SRC2: is a data address.Result: is the data to be stored in the register addressed by DEST.Source1: is the data stored in the register addressed by SRC1. Source2:is the data stored in the register addressed by SRC2.

Computer System

Referring to FIG. 1, a computer system upon which an embodiment of thepresent invention can be implemented is shown as computer system 100.Computer system 100 comprises a bus 101, or other communicationshardware and software, for communicating information, and a processor109 coupled with bus 101 for processing information. Computer system 100further comprises a random access memory (RAM) or other dynamic storagedevice (referred to as main memory 104), coupled to bus 101 for storinginformation and instructions to be executed by processor 109. Mainmemory 104 also may be used for storing temporary variables or otherintermediate information during execution of instructions by processor109. Computer system 100 also comprises a read only memory (ROM) 106,and/or other static storage device, coupled to bus 101 for storingstatic information and instructions for processor 109. Data storagedevice 107 is coupled to bus 101 for storing information andinstructions.

Furthermore, a data storage device 107, such as a magnetic disk oroptical disk, and its corresponding disk drive, can be coupled tocomputer system 100. Computer system 100 can also be coupled via bus 101to a display device 121 for displaying information to a computer user.Display device 121 can include a frame buffer, specialized graphicsrendering devices, a cathode ray tube (CRT), and/or a flat paneldisplay. An alphanumeric input device 122, including alphanumeric andother keys, is typically coupled to bus 101 for communicatinginformation and command selections to processor 109. Another type ofuser input device is cursor control 123, such as a mouse, a trackball, apen, a touch screen, or cursor direction keys for communicatingdirection information and command selections to processor 109, and forcontrolling cursor movement on display device 121. This input devicetypically has two degrees of freedom in two axes, a first axis (e.g., x)and a second axis (e.g., y), which allows the device to specifypositions in a plane. However, this invention should not be limited toinput devices with only two degrees of freedom.

Another device which may be coupled to bus 101 is a hard copy device 124which may be used for printing instructions, data, or other informationon a medium such as paper, film, or similar types of media.Additionally, computer system 100 can be coupled to a device for soundrecording, and/or playback 125, such as an audio digitizer coupled to amicrophone for recording information. Further, the device may include aspeaker which is coupled to a digital to analog (D/A) converter forplaying back the digitized sounds.

Also, computer system 100 can be a terminal in a computer network (e.g.,a LAN). Computer system 100 would then be a computer subsystem of acomputer system including a number of networked devices. Computer system100 optionally includes video digitizing device 126. Video digitizingdevice 126 can be used to capture video images that can be transmittedto others on the computer network.

Computer system 100 is useful for supporting computer supportedcooperation (CSC—the integration of teleconferencing with mixed mediadata manipulation), 2D/3D graphics, image processing, videocompression/decompression, recognition algorithms and audiomanipulation.

Processor

FIG. 2 illustrates a detailed diagram of processor 109. Processor 109can be implemented on one or more substrates using any of a number ofprocess technologies, such as, BiCMOS, CMOS, and NMOS. Processor 109comprises a decoder 202 for decoding control signals and data used byprocessor 109. Data can then be stored in register file 204 via internalbus 205. As a matter of clarity, the registers of an embodiment shouldnot be limited in meaning to a particular type of circuit. Rather, aregister of an embodiment need only be capable of storing and providingdata, and performing the functions described herein.

Depending on the type of data, the data may be stored in integerregisters 201, registers 209, status registers 208, or instructionpointer register 211. Other registers can be included in the registerfile 204, for example, floating point registers. In one embodiment,integer registers 201 store thirty-two bit integer data. In oneembodiment, registers 209 contains eight registers, R₀ 212 a through R₇212 h. Each register in registers 209 is sixty-four bits in length. R₀212 a, R₁ 212 b and R₂ 212 c are examples of individual registers inregisters 209. Thirty-two bits of a register in registers 209 can bemoved into an integer register in integer registers 201. Similarly, avalue in an integer register can be moved into thirty-two bits of aregister in registers 209.

Status registers 208 indicate the status of processor 109. Instructionpointer register 211 stores the address of the next instruction to beexecuted. Integer registers 201, registers 209, status registers 208,and instruction pointer register 211 all connect to internal bus 205.Any additional registers would also connect to the internal bus 205.

In another embodiment, some of these registers can be used for twodifferent types of data. For example, registers 209 and integerregisters 201 can be combined where each register can store eitherinteger data or packed data. In another embodiment, registers 209 can beused as floating point registers. In this embodiment, packed data can bestored in registers 209 or floating point data. In one embodiment, thecombined registers are sixty-four bits in length and integers arerepresented as sixty-four bits. In this embodiment, in storing packeddata and integer data, the registers do not need to differentiatebetween the two data types.

Functional unit 203 performs the operations carried out by processor109. Such operations may include shifts, addition, subtraction andmultiplication, etc. Functional unit 203 connects to internal bus 205.Cache 206 is an optional element of processor 109 and can be used tocache data and/or control signals from, for example, main memory 104.Cache 206 is connected to decoder 202, and is connected to receivecontrol signal 207.

FIG. 3 illustrates the general operation of processor 109. That is, FIG.3 illustrates the steps followed by processor 109 while performing anoperation on packed data, performing an operation on unpacked data, orperforming some other operation. For example, such operations include aload operation to load a register in register file 204 with data fromcache 206, main memory 104, read only memory (ROM) 106, or data storagedevice 107. In one embodiment of the present invention, processor 109supports most of the instructions supported by the Intel 80486™,available from Intel Corporation of Santa Clara, Calif. In anotherembodiment of the present invention, processor 109 supports all theoperations supported by the Intel 80486™, available from IntelCorporation of Santa Clara, Calif. In another embodiment of the presentinvention, processor 109 supports all the operations supported by thePentium™ processor, the Intel 80486™ processor, the 80386™ processor,the Intel 80286™ processor, and the Intel 8086™ processor, all availablefrom Intel Corporation of Santa Clara, Calif. In another embodiment ofthe present invention, processor 109 supports all the operationssupported in the IA™—Intel Architecture, as defined by Intel Corporationof Santa Clara, Calif. (see Microprocessors, Intel Data Books volume 1and volume 2, 1992 and 1993, available from Intel of Santa Clara,Calif.). Generally, processor 109 can support the present instructionset for the Pentium™ processor, but can also be modified to incorporatefuture instructions, as well as those described herein. What isimportant is that processor 109 can support previously used operationsin addition to the operations described herein.

At step 301, the decoder 202 receives a control signal 207 from eitherthe cache 206 or bus 101. Decoder 202 decodes the control signal todetermine the operations to be performed.

Decoder 202 accesses the register file 204, or a location in memory, atstep 302. Registers in the register file 204, or memory locations in thememory, are accessed depending on the register address specified in thecontrol signal 207. For example, for an operation on packed data,control signal 207 can include SRC1, SRC2 and DEST register addresses.SRC1 is the address of the first source register. SRC2 is the address ofthe second source register. In some cases, the SRC2 address is optionalas not all operations require two source addresses. If the SRC2 addressis not required for an operation, then only the SRC1 address is used.DEST is the address of the destination register where the result data isstored. In one embodiment, SRC1 or SRC2 is also used as DEST. SRC1, SRC2and DEST are described more fully in relation to FIG. 6 a and FIG. 6 b.The data stored in the corresponding registers is referred to asSource1, Source2, and Result respectively. Each of these data issixty-four bits in length.

In another embodiment of the present invention, any one, or all, ofSRC1, SRC2 and DEST, can define a memory location in the addressablememory space of processor 109. For example, SRC1 may identify a memorylocation in main memory 104 while SRC2 identifies a first register ininteger registers 201, and DEST identifies a second register inregisters 209. For simplicity of the description herein, references aremade to the accesses to the register file 204, however, these accessescould be made to memory instead.

In another embodiment of the present invention, the operation code onlyincludes two addresses, SRC1 and SRC2. In this embodiment, the result ofthe operation is stored in the SRC1 or SRC2 register. That is SRC1 (orSRC2) is used as the DEST. This type of addressing is compatible withprevious CISC instructions having only two addresses. This reduces thecomplexity in the decoder 202. Note, in this embodiment, if the datacontained in the SRC1 register is not to be destroyed, then that datamust first be copied into another register before the execution of theoperation. The copying would require an additional instruction. Tosimplify the description herein, the three address addressing schemewill be described (i.e. SRC1, SRC2, and DEST). However, it should beremembered that the control signal, in one embodiment, may only includeSRC1 and SRC2, and that SRC1 (or SRC2) identifies the destinationregister.

Where the control signal requires an operation, at step 303, functionalunit 203 will be enabled to perform this operation on accessed data fromregister file 204. Once the operation has been performed in functionalunit 203, at step 304, the result is stored back into register file 204according to requirements of control signal 207.

Data and Storage Formats

FIG. 4 a illustrates some of the data formats as may be used in thecomputer system of FIG. 1. These data formats are fixed point Processor109 can manipulate these data formats. Multimedia algorithms often usethese data formats. A byte 401 contains eight bits of information. Aword 402 contains sixteen bits of information, or two bytes. Adoubleword 403 contains thirty-two bits of information, or four bytes.Thus, processor 109 executes control signals that may operate on any oneof these memory data formats.

In the following description, references to bit, byte, word, anddoubleword subfields are made. For example, bit six through bit zero ofthe byte 00111010 ₂ (shown in base 2) represent the subfield 111010₂.

FIG. 4 b through FIG. 4 d illustrate in-register representations used inone embodiment of the present invention. For example, unsigned bytein-register representation 410 can represent data stored in a registerin integer registers 201. In one embodiment, a register, in integerregisters 201, is sixty-four bits in length. In another embodiment, aregister, in integer registers 201, is thirty-two bits in length. Forthe simplicity of the description, the following describes sixty-fourbit integer registers, however, thirty-two bit integer registers can beused.

Unsigned byte in-register representation 410 illustrates processor 109storing a byte 401 in integer registers 201, the first eight bits, bitseven through bit zero, in that register are dedicated to the data byte401. These bits are shown as {b}. To properly represent this byte, theremaining 56 bits must be zero. For an signed byte in-registerrepresentation 411, integer registers 201 store the data in the firstseven bits, bit six through bit zero, to be data. The seventh bitrepresents the sign bit, shown as an {s}. The remaining bit sixty-threethrough bit eight are the continuation of the sign for the byte.

Unsigned word in-register representation 412 is stored in one registerof integer registers 201. Bit fifteen through bit zero contain anunsigned word 402. These bits are shown as {w}. To properly representthis word, the remaining bit sixty-three through bit sixteen must bezero. A signed word 402 is stored in bit fourteen through bit zero asshown in the signed word in-register representation 413. The remainingbit sixty-three through bit fifteen is the sign field.

A doubleword 403 can be stored as an unsigned doubleword in-registerrepresentation 414 or a signed doubleword in-register representation415. Bit thirty-one through bit zero of an unsigned doublewordin-register representation 414 are the data. These bits are shown as(d). To properly represent this unsigned doubleword, the remaining bitsixty-three through bit thirty-two must be zero. Integer registers 201stores a signed doubleword in-register representation 415 in its bitthirty through bit zero; the remaining bit sixty-three through bitthirty-one are the sign field.

As indicated by the above FIG. 4 b through FIG. 4 d, storage of somedata types in a sixty-four bit wide register is an inefficient method ofstorage. For example, for storage of an unsigned byte in-registerrepresentation 410 bit sixty-three through bit eight must be zero, whileonly bit seven through bit zero may contain non-zero bits. Thus, aprocessor storing a byte in a sixty-four bit register uses only 12.5% ofthe register's capacity. Similarly, only the first few bits ofoperations performed by functional unit 203 will be important.

FIG. 5 a illustrates the data formats for packed data. Each packed dataincludes more than one independent data element. Three packed dataformats are illustrated; packed byte 501, packed word 502, and packeddoubleword 503. Packed byte, in one embodiment of the present invention,is sixty-four bits long containing eight data elements. Each dataelement is one byte long. Generally, a data element is an individualpiece of data that is stored in a single register (or memory location)with other data elements of the same length. In one embodiment of thepresent invention, the number of data elements stored in a register issixty-four bits divided by the length in bits of a data element.

Packed word 502 is sixty-four bits long and contains four word 402 dataelements. Each word 402 data element contains sixteen bits ofinformation.

Packed doubleword 503 is sixty-four bits long and contains twodoubleword 403 data elements. Each doubleword 403 data element containsthirty-two bits of information.

FIG. 5 b through FIG. 5 d illustrate the in-register packed data storagerepresentation. Unsigned packed byte in-register representation 510illustrates the storage of packed byte 501 in one of the registers R₀212 a through R_(n) 212 af. Information for each byte data element isstored in bit seven through bit zero for byte zero, bit fifteen throughbit eight for byte one, bit twenty-three through bit sixteen for bytetwo, bit thirty-one through bit twenty-four for byte three, bitthirty-nine through bit thirty-two for byte four, bit forty-seventhrough bit forty for byte five, bit fifty-five through bit forty-eightfor byte six and bit sixty-three through bit fifty-six for byte seven.Thus, all available bits are used in the register. This storagearrangement increases the storage efficiency of the processor. As well,with eight data elements accessed, one operation can now be performed oneight data elements simultaneously. Signed packed byte in-registerrepresentation 511 is similarly stored in a register in registers 209.Note that only the eighth bit of every byte data element is thenecessary sign bit; other bits may or may not be used to indicate sign.

Unsigned packed word in-register representation 512 illustrates how wordthree through word zero are stored in one register of registers 209. Bitfifteen through bit zero contain the data element information for wordzero, bit thirty-one through bit sixteen contain the information fordata element word one, bit forty-seven through bit thirty-two containthe information for data element word two and bit sixty-three throughbit forty-eight contain the information for data element word three.Signed packed word in-register representation 513 is similar to theunsigned packed word in-register representation 512. Note that only thesixteenth bit of each word data element contains the necessary signindicator.

Unsigned packed doubleword in-register representation 514 shows howregisters 209 store two doubleword data elements. Doubleword zero isstored in bit thirty-one through bit zero of the register. Doublewordone is stored in bit sixty-three through bit thirty-two of the register.Signed packed doubleword in-register representation 515 is similar tounsigned packed doubleword in-register representation 514. Note that thenecessary sign bit is the thirty-second bit of the doubleword dataelement.

As mentioned previously, registers 209 may be used for both packed dataand integer data. In this embodiment of the present invention, theindividual programming processor 109 may be required to track whether anaddressed register, R₀ 212 a for example, is storing packed data orsimple integer/fixed point data. In an alternative embodiment, processor109 could track the type of data stored in individual registers ofregisters 209. This alternative embodiment could then generate errorsif, for example, a packed addition operation were attempted onsimple/fixed point integer data.

Control Signal Formats

The following describes one embodiment of control signal formats used byprocessor 109 to manipulate packed data. In one embodiment of thepresent invention, control signals are represented as thirty-two bits.Decoder 202 may receive control signal 207 from bus 101. In anotherembodiment, decoder 202 can also receive such control signals from cache206.

FIG. 6 a illustrates a general format for a control signal operating onpacked data. Operation field OP 601, bit thirty-one through bittwenty-six, provides information about the operation to be performed byprocessor 109; for example, packed addition, packed subtraction, etc.SRC1 602, bit twenty-five through twenty, provides the source registeraddress of a register in registers 209. This source register containsthe first packed data, Source1, to be used in the execution of thecontrol signal. Similarly, SRC2 603, bit nineteen through bit fourteen,contains the address of a register in registers 209. This second sourceregister contains the packed data, Source2, to be used during executionof the operation. DEST 605, bit five through bit zero, contains theaddress of a register in registers 209. This destination register willstore the result packed data, Result, of the packed data operation.

Control bits SZ 610, bit twelve and bit thirteen, indicates the lengthof the data elements in the first and second packed data sourceregisters. If SZ 610 equals 01.sub.2, then the packed data is formattedas packed byte 501. If SZ 610 equals 10.sub.2, then the packed data isformatted as packed word 502. SZ 610 equaling 00.sub.2 or 11.sub.2 isreserved, however, in another embodiment, one of these values could beused to indicate packed doubleword 503.

Control bit T 611, bit eleven, indicates whether the operation is to becarried out with saturate mode. If T 611 equals one, then a saturatingoperation is performed. If T 611 equals zero, then a nonsaturatingoperation is performed. Saturating operations will be described later.

Control bit S 612, bit ten, indicates the use of a signed operation. IfS 612 equals one, then a signed operation is performed. If S 612 equalszero, then an unsigned operation is performed.

FIG. 6 b illustrates a second general format for a control signaloperating on packed data. This format corresponds with the generalinteger opcode format described in the “Pentium™ Processor Family User'sManual,” available from Intel Corporation, Literature Sales, P.O. Box7641, Mt. prospect, Ill., 60056-7641. Note that OP 601, SZ 610, T 611,and S 612 are all combined into one large field. For some controlsignals, bits three through five are SRC1 602. In one embodiment, wherethere is a SRC1 602 address, then bits three through five alsocorrespond to DEST 605. In an alternate embodiment, where there is aSRC2 603 address, then bits zero through two also correspond to DEST605. For other control signals, like a packed shift immediate operation,bits three through five represent an extension to the opcode field. Inone embodiment, this extension allows a programmer to include animmediate value with the control signal, such as a shift count value. Inone embodiment, the immediate value follows the control signal. This isdescribed in more detail in the “Pentium™ Processor Family User'sManual,” in appendix F, pages F-1 through F-3. Bits zero through tworepresent SRC2 603. This general format allows register to register,memory to register, register by memory, register by register, registerby immediate, register to memory addressing. Also, in one embodiment,this general format can support integer register to register, andregister to integer register addressing.

Description of Saturate/Unsaturate

As mentioned previously, T 611 indicates whether operations optionallysaturate. Where the result of an operation, with saturate enabled,overflows or underflows the range of the data, the result will beclamped. Clamping means setting the result to a maximum or minimum valueshould a result exceed the range's maximum or minimum value. In the caseof underflow, saturation clamps the result to the lowest value in therange and in the case of overflow, to the highest value. The allowablerange for each data format is shown in Table 1.

TABLE 1 Data Format Minimum Value Maximum Value Unsigned Byte 0 255Signed Byte −128   127 Unsigned Word 0 65535 Signed Word −32768    32767Unsigned Doubleword 0 2⁶⁴ − 1 Signed Doubleword  −2⁶³ 2⁶³ − 1

As mentioned above, T 611 indicates whether saturating operations arebeing performed. Therefore, using the unsigned byte data format, if anoperation's result=258 and saturation was enabled, then the result wouldbe clamped to 255 before being stored into the operation's destinationregister. Similarly, if an operation's result=−32999 and processor 109used signed word data format with saturation enabled, then the resultwould be clamped to −32768 before being stored into the operation'sdestination register.

Data Manipulation Operations

In one embodiment of the present invention, the performance ofmultimedia applications is improved by not only supporting a standardCISC instruction set (unpacked data operations), but by supportingoperations on packed data. Such packed data operations can include anaddition, a subtraction, a multiplication, a compare, a shift, an AND,and an XOR. However, to take full advantage of these operations, it hasbeen determined that data manipulation operations should be included.Such data manipulation operations can include a move, a pack, and anunpack. Move, pack and unpack facilitate the execution of the otheroperations by generating packed data in formats that allow for easieruse by programmers.

For further background on the other packed operations, see “AMicroprocessor Having a Compare Operation,” filed on Dec. 21, 1994, Ser.No. 349,040, now abandoned, “A Microprocessor Having a MultiplyOperation,” filed on Dec. 1, 1994, Ser. No. 349,559, now abandoned, “ANovel Processor Having Shift Operations,” filed on Dec. 1, 1994, Ser.No. 349,730, now abandoned, “A Method and Apparatus Using Packed Data ina Processor,” filed on Dec. 30, 1993, Ser. No. 08/176,123, now abandonedand “A Method and Apparatus Using Novel Operations in a Processor,”filed on Dec. 30, 1993, Ser. No. 08/175,772, now abandoned all assignedto the assignee of the present invention.

Move Operation

The move operation transfers data to or from registers 209. In oneembodiment, SRC2 603 is the address containing the source data and DEST605 is the address where the data is to be transferred. In thisembodiment, SRC1 602 would not be used. In another embodiment, SRC1 602is DEST 605.

For the purposes of the explanation of the move operation, a distinctionis drawn between a register and a memory location. Registers are foundin register file 204 while memory can be, for example, in cache 206,main memory 104, ROM 106, data storage device 107.

The move operation can move data from memory to registers 209, fromregisters 209 to memory, and from a register in registers 209 to asecond register in registers 209. In one embodiment, packed data isstored in different registers than those used to store integer data. Inthis embodiment, the move operation can move data from integer registers201 to registers 209. For example, in processor 109, if packed data isstored in registers 209 and integer data is stored in integer registers201, then a move instruction can be used to move data from integerregisters 201 to registers 209, and vice versa.

In one embodiment, when a memory address is indicated for the move, theeight bytes of data at the memory location (the memory locationindicating the least significant byte) are loaded to a register inregisters 209 or stored from that register. When a register in registers209 is indicated, the contents of that register are moved to or loadedfrom a second register in registers 209. If the integer registers 201are sixty-four bits in length, and an integer register is specified,then the eight bytes of data in that integer register are loaded to aregister in registers 209 or stored from that register.

In one embodiment, integers are represented as thirty-two bits. When amove operation is performed from registers 209 to integer registers 201,then only the low thirty-two bits of the packed data are moved to thespecified integer register. In one embodiment, the high order thirty-twobits are zeroed. Similarly, only the low thirty-two bits of a registerin registers 209 are loaded when a move is executed from integerregisters 201 to registers 209. In one embodiment, processor 109supports a thirty-two bit move operation between a register in registers209 and memory. In another embodiment, a move of only thirty-two bits isperformed on the high order thirty-two bits of packed data.

Pack Operation

In one embodiment of the present invention, the SRC1 602 registercontains data (Source1), the SRC2 603 register contains the data(Source2), and DEST 605 register will contain the result data (Result)of the operation. That is, parts of Source1 and parts of Source2 will bepacked together to generate Result.

In one embodiment, a pack operation converts packed words (ordoublewords) into packed bytes (or words) by packing the low order bytes(or words) of the source packed words (or doublewords) into the bytes(or words) of the Result. In one embodiment, the pack operation convertsquad packed words into packed doublewords. This operation can beoptionally performed with signed data. Further, this operation can beoptionally performed with saturate.

FIG. 7 illustrates one embodiment of a method of performing a packoperation on packed data. This embodiment can be implemented in theprocessor 109 of FIG. 2.

At step 701, decoder 202 decodes control signal 207 received byprocessor 109. Thus, decoder 202 decodes: the operation code for theappropriate pack operation; SRC1 602, SRC2 603 and DEST 605 addresses inregisters 209; saturate/unsaturate, signed/unsigned, and length of thedata elements in the packed data. As mentioned previously, SRC1 602 (orSRC2 603) can be used as DEST 605.

At step 702, via internal bus 205, decoder 202 accesses registers 209 inregister file 204 given the SRC1 602 and SRC2 603 addresses. Registers209 provides functional unit 203 with the packed data stored in the SRC1602 register (Source1), and the packed data stored in SRC2 603 register(Source2). That is, registers 209 communicate the packed data tofunctional unit 203 via internal bus 205. At step 703, decoder 202enables functional unit 203 to perform the appropriate pack operation.Decoder 202 further communicates, via internal bus 205, saturate and thesize of the data elements in Source1 and Source2. Saturate is optionallyused to maximize the value of the data in the result data element. Ifthe value of the data elements in Source1 or Source2 are greater than orless than the range of values that the data elements of Result canrepresent, then the corresponding result data element is set to itshighest or lowest value. For example, if signed values in the word dataelements of Source1 and Source2 are smaller than 0x80 (or 0x8000 fordoublewords), then the result byte (or word) data elements are clampedto 0x80 (or 0x8000 for doublewords). If signed values in word dataelements of Source1 and Source 2 are greater than 0x7F (or 0x7FFF fordoublewords), then the result byte (or word) data elements are clampedto 0x7F (or 9x7FFF).

At step 710, the size of the data element determines which step is to beexecuted next. If the size of the data elements is sixteen bits (packedword 502 data), then functional unit 203 performs step 712. However, ifthe size of the data elements in the packed data is thirty-two bits(packed doubleword 503 data), then functional unit 203 performs step714.

Assuming the size of the source data elements is sixteen bits, then step712 is executed. In step 712, the following is performed. Source1 bitsseven through zero are Result bits seven through zero. Source1 bitstwenty-three through sixteen are Result bits fifteen through eight.Source1 bits thirty-nine through thirty-two are Result bits twenty-threethrough sixteen. Source1 bits sixty-three through fifty-six are Resultbits thirty-one through twenty-four. Source2 bits seven through zero areResult bits thirty-nine through thirty-two. Source2 bits twenty-threethrough sixteen are Result bits forty-seven through forty. Source2 bitsthirty-nine through thirty-two are Result bits fifty-five throughforty-eight. Source2 bits sixty-three through fifty-six are Result bitsthirty-one through twenty-four. If saturate is set, then the high orderbits of each word are tested to determine whether the Result dataelement should be clamped.

Assuming the size of the source data elements is thirty-two bits, thenstep 714 is executed. In step 714, the following is performed. Source1bits fifteen through zero are Result bits fifteen through zero. Source1bits forty-seven through thirty-two are Result bits thirty-one throughsixteen. Source2 bits fifteen through zero are Result bits forty-seventhrough thirty-two. Source2 bits forty-seven through thirty-two areResult bits sixty-three through forty-eight. If saturate is set, thenthe high order bits of each doubleword are tested to determine whetherthe Result data element should be clamped.

In one embodiment, the packing of step 712 is performed simultaneously.However, in another embodiment, this packing is performed serially. Inanother embodiment, some of the packing is performed simultaneously andsome is performed serially. This discussion also applies to the packingof step 714.

At step 720, the Result is stored in the DEST 605 register.

Table 2 illustrates the in-register representation of a pack unsignedword operation with no saturation. The first row of bits is the packeddata representation of Source1. The second row of bits is the datarepresentation of Source2. The third row of bits is the packed datarepresentation of the Result. The number below each data element bit isthe data element number. For example, Source1 data element three is10000000₂.

TABLE 2

Table 3 illustrates the in-register representation of pack signeddoubleword operation with saturation.

TABLE 3

Pack Circuits

In one embodiment of the present invention, to achieve efficientexecution of the pack operation parallelism is used. FIGS. 8 a and 8 billustrate one embodiment of a circuit that can perform a pack operationon packed data. The circuit can optionally perform the pack operationwith saturation.

The circuit of FIGS. 8 a and 8 b includes an operation control circuit800, a result register 852, a result register 853, eight sixteen bit toeight bit test saturate circuits, and four thirty-two bit to sixteen bittest saturate circuits.

Operation control 800 receives information from the decoder 202 toenable a pack operation. Operation control 800 uses the saturate valueto enable the saturation tests for each of the test saturate circuits.If the size of the source packed data is word packed data 503, thenoutput enable 831 is set by operation control 800. This enables theoutput of output register 852. If the size of the source packed data isdoubleword packed data 504, then output enable 832 is set by operationcontrol 800. This enables the output of output register 853.

Each test saturate circuit can selectively test for saturation. If atest for saturation is disabled, then each test saturate circuit merelypasses the low order bits through to a corresponding position in aresult register. If a test for saturate is enabled, then each testsaturate circuit tests the high order bits to determine if the resultshould be clamped.

Test saturate 810 through test saturate 817 have sixteen bit inputs andeight bit outputs. The eight bit outputs are the lower eight bits of theinputs, or optionally, are a clamped value (0x80, 0x7F, or 0xFF). Testsaturate 810 receives Source1 bits fifteen through zero and outputs bitsseven through zero for result register 852. Test saturate 811 receivesSource1 bits thirty-one through sixteen and outputs bits fifteen througheight for result register 852. Test saturate 812 receives Source1 bitsforty-seven through thirty-two and outputs bits twenty-three throughsixteen for result register 852. Test saturate 813 receives Source1 bitssixty-three through forty-eight and outputs bits thirty-one throughtwenty-four for result register 852. Test saturate 814 receives Source2bits fifteen through zero and outputs bits thirty-nine throughthirty-two for result register 852. Test saturate 815 receives Source2bits thirty-one through sixteen and outputs bits forty-seven throughforty for result register 852. Test saturate 816 receives Source2 bitsforty-seven through thirty-two and outputs bits fifty-five throughforty-eight for result register 852. Test saturate 817 receives Source2bits sixty-three through forty-eight and outputs bits sixty-threethrough fifty-six for result register 852.

Test saturate 820 through test saturate 823 have thirty-two bit inputsand sixteen bit outputs. The sixteen bit outputs are the lower sixteenbits of the inputs, or optionally, are a clamped value (0x8000, 0x7FFF,or 0xFFFF). Test saturate 820 receives Source1 bits thirty-one throughzero and outputs bits fifteen through zero for result register 853. Testsaturate 821 receives Source1 bits sixty-three through thirty-two andoutputs bits thirty-one through sixteen for result register 853. Testsaturate 822 receives Source2 bits thirty-one through zero and outputsbits forty-seven through thirty-two for result register 853. Testsaturate 823 receives Source2 bits sixty-three through thirty-two andoutputs bits sixty-three though forty-eight of result register 853.

For example, in Table 4, a pack word unsigned with no saturate isperformed. Operation control 800 will enable result register 852 tooutput result>63:0! 860.

TABLE 4

However, if a pack doubleword unsigned with no saturate is performed,operation control 800 will enable result register 853 to outputresult[63:0] 860. Table 5 illustrates this result.

TABLE 5

Unpack Operation

In one embodiment, an unpack operation interleaves the low order packedbytes, words or doublewords of two source packed data to generate resultpacked bytes, words, or doublewords.

FIG. 9 illustrates one embodiment of a method of performing an unpackoperation on packed data. This embodiment can be implemented in theprocessor 109 of FIG. 2.

Step 701 and step 702 are executed first. At step 903, decoder 202enables functional unit 203 to perform the unpack operation. Decoder 202communicates, via internal bus 205, the size of the data elements inSource1 and Source2.

At step 910, the size of the data element determines which step is to beexecuted next. If the size of the data elements is eight bits (packedbyte 501 data), then functional unit 203 performs step 712. However, ifthe size of the data elements in the packed data is sixteen bits (packedword 502 data), then functional unit 203 performs step 714. However, ifthe size of the data elements in the packed data is thirty-two bits(packed doubled word 503 data), then functional unit 203 performs step716.

Assuming the size of the source data elements is eight bits, then step712 is executed. In step 712, the following is performed. Source1 bitsseven through zero are Result bits seven through zero. Source2 bitsseven through zero are Result bits fifteen through eight. Source1 bitsfifteen through eight are Result bits twenty-three through sixteen.Source2 bits fifteen through eight are Result bits thirty-one throughtwenty-four. Source1 bits twenty-three through sixteen are Result bitsthirty-nine through thirty-two. Source2 bits twenty-three throughsixteen are Result bits forty-seven through forty. Source1 bitsthirty-one through twenty-four are Result bits fifty-five throughforty-eight. Source2 bits thirty-one through twenty-four are Result bitssixty-three through fifty-six.

Assuming the size of the source data elements is sixteen bits, then step714 is executed. In step 714, the following is performed. Source1 bitsfifteen through zero are Result bits fifteen through zero. Source2 bitsfifteen through zero are Result bits thirty-one through sixteen. Source1bits thirty-one through sixteen are Result bits forty-seven throughthirty-two. Source2 bits thirty-one through sixteen are Result bitssixty-three through forty-eight.

Assuming the size of the source data elements is thirty-two bits, thenstep 716 is executed. In step 716, the following is performed. Source1bits thirty-one through zero are Result bits thirty-one through zero.Source2 bits thirty-one through zero are Result bits sixty-three throughthirty-two.

In one embodiment, the unpacking of step 712 is performedsimultaneously. However, in another embodiment, this unpacking isperformed serially. In another embodiment, some of the unpacking isperformed simultaneously and some is performed serially. This discussionalso applies to the unpacking of step 714 and step 716.

At step 720, the Result is stored in the DEST 605 register.

Table 6 illustrates the in-register representation of an unpack byteoperation.

TABLE 6

Table 7 illustrates the in-register representation of an unpack wordoperation.

TABLE 7

Table 8 illustrates the in-register representation of an unpackdoubleword operation.

TABLE 8

Unpack Circuits

In one embodiment of the present invention, to achieve efficientexecution of the unpack operation parallelism is used. FIG. 10illustrates one embodiment of a circuit that can perform an unpackoperation on packed data.

The circuit of FIG. 10 includes the operation control circuit 800, aresult register 1052, a result register 1053, and a result register1054.

Operation control 800 receives information from the decoder 202 toenable an unpack operation. If the size of the source packed data isbyte packed data 502, then output enable 1032 is set by operationcontrol 800. This enables the output of result register 1052. If thesize of the source packed data is word packed data 503, then outputenable 1033 is set by operation control 800. This enables the output ofoutput register 1053. If the size of the source packed data isdoubleword packed data 504, then output enable 1034 is set by operationcontrol 800. This enables the output of output result register 1054.

Result register 1052 has the following inputs. Source1 bits seventhrough zero are bits seven through zero for result register 1052.Source2 bits seven through zero are bits fifteen through eight forresult register 1052. Source1 bits fifteen through eight are bitstwenty-three through sixteen for result register 1052. Source2 bitsfifteen through eight are bits thirty-one through twenty-four for resultregister 1052. Source1 bits twenty-three through sixteen are bitsthirty-nine through thirty-two for result register 1052. Source2 bitstwenty-three through sixteen are bits forty-seven through forty forresult register 1052. Source1 bits thirty-one through twenty-four arebits fifty-five through forty-eight for result register 1052. Source2bits thirty-one through twenty-four are bits sixty-three throughfifty-six for result register 1052. Result register 1053 has thefollowing inputs. Source1 bits fifteen through zero are bits fifteenthrough zero for result register 1053. Source2 bits fifteen through zeroare bits thirty-one through sixteen for result register 1053. Source1bits thirty-one through sixteen are bits forty-seven through thirty-twofor result register 1053. Source2 bits thirty-one through sixteen arebits sixty-three though forty-eight of result register 853.

Result register 1054 has the following inputs. Source1 bits thirty-onethrough zero are bits thirty-one through zero for result register 1054.Source2 bits thirty-one through zero are bits sixty-three throughthirty-two of result register 1054.

For example, in Table 9, an unpack word operation is performed.Operation control 800 will enable result register 1053 to outputresult[63:0] 860.

TABLE 9

However, if an unpack doubleword is performed, operation control 800will enable result register 1054 to output result[63:0] 860. Table 10illustrates this result.

TABLE 10

Therefore, the move, pack and unpack operations can manipulate multipledata elements. In prior art processors, to perform these types ofmanipulations, multiple separate operations would be needed to perform asingle packed move, pack or unpack operation. The data lines for thepacked data operations, in one embodiment, all carry relevant data. Thisleads to a higher performance computer system.

1-37. (canceled)
 38. A computing device comprising: a communication bus a cache; a decoder operable to decode instructions specifying data manipulation operations; a register file comprising: a first set of registers operable to store 32-bit integer data; and a second set of registers operable to store a first packed data and a second packed data respectively including a first plurality of data elements and a second plurality of data elements; a functional unit coupled to the cache, the decoder, the register file, and the communication bus, and operable to execute decoded instructions specifying data manipulation operations, including: a first move instruction that, when executed by the functional unit, causes data to be transferred between a first packed data register and a second packed data register; a second move instruction that, when executed by the functional unit, causes data to be transferred between the first packed data register and a main memory; a third move instruction that, when executed by the functional unit, causes data to be transferred between the first packed data register and a 32-bit register; and an unpack instruction that, when executed by the functional unit, causes data elements from the first plurality of data elements and corresponding data elements from the second plurality of data elements to be interleaved into the register file as a third plurality of data elements in a third packed data, wherein each data element in said first plurality of data elements corresponds to a different data element in said second plurality of data elements, in a respective position; and a display device coupled to the communication bus and comprising graphics rendering devices.
 39. The computing device of claim 38, wherein a first register of the second set of registers is to hold the first packed data, a second register of the second set of registers is to hold the second packed data, and wherein the unpack instruction, when executed by the functional unit, causes data elements from the first plurality of data elements and corresponding data elements from the second plurality of data elements to be interleaved into the first register.
 40. The computing device of claim 38, wherein the first set of registers includes sixteen registers.
 41. The computing device of claim 40, wherein the first set of registers includes at least special registers and an instruction pointer register.
 42. The computing device of claim 38, wherein the second set of registers includes floating point registers independent from the first set of registers.
 43. The computing device of claim 38, wherein each of the second set of registers four 16 bit data elements, two 32-bit data elements, or one 64 bit data element.
 44. The computing device of claim 38, wherein each of the second set of registers eight 16 bit data elements, four 32-bit data elements, two 64 bit data element, or one 128 bit data element.
 45. A processor comprising: a cache; a decoder to decode an unpack instruction; a register file including a first register to hold a first packed data including a first plurality of data elements and a second register to hold second packed data including a second plurality of data elements; a functional unit coupled to the cache, the decode, and the register file, the functional unit to interleave data elements from the first plurality of data elements and data elements from the second plurality of data elements in corresponding positions to form a third packed data.
 46. The processor of claim 45, wherein the first plurality of data elements is to include a first element to be held in a first position of the first register and a second element to be held in a second position of the first register, and wherein the second plurality of data elements is to include a third element to be held in the first position of the second register that is to correspond to the first position of the first register and a fourth element to be held in the second position of the second register that is to correspond to the second position of the first register.
 47. The processor of claim 46, wherein the functional unit to interleave data elements from the first plurality of data elements and data elements from the second plurality of data elements in corresponding positions to form a third packed data comprises the functional unit to store the third element to be held in the first position of the second register to the second position of the first register and to maintain the first element in the first position of the first register to form the third packed data to be held in the first register.
 48. The processor of claim 45, wherein the first plurality of data elements is to include a first element to be held in a first position of the first register, a second element to be held in a second position of the first register, a third element to be held in a third position of the first register, and a fourth element to be held in a fourth position of the first register, and wherein the second plurality of data elements is to include a fifth element to be held in the first position of the second register that is to correspond to the first position of the first register, a sixth element to be held in the second position of the second register that is to correspond to the second position of the first register, a seventh element to be held in the third position of the second register that is to correspond to the second position of the first register, and an eighth element to be held in the fourth position of the second register that is to correspond to the fourth position of the first register.
 49. The processor of claim 48, wherein the functional unit to interleave data elements from the first plurality of data elements and data elements from the second plurality of data elements in corresponding positions to form a third packed data comprises: the functional unit to: maintain the first element to be held in the first position of the first register at the first position of the first register, store the fifth data element to be held in the first position of the second register to the second position of the first register, maintain the third data element to be held in the third position of the first register in the third position of the first register, and store the seventh data element to be held in the third position of the second register in the fourth position of the first register to form the third packed data.
 50. A method comprising: decoding an unpack instruction; storing a first packed data including a first plurality of data elements in a first register; storing a second packed data including a second plurality of data elements in a second register; in response to decoding the unpack instruction, interleaving the data elements from the first plurality of data elements and data elements from the second plurality of data elements in corresponding positions to form a third packed data.
 51. The method of claim 50, wherein the first plurality of data elements is to include a first element to be held in a first position of the first register and a second element to be held in a second position of the first register, and wherein the second plurality of data elements is to include a third element to be held in the first position of the second register that is to correspond to the first position of the first register and a fourth element to be held in the second position of the second register that is to correspond to the second position of the first register.
 52. The method of claim 51, wherein interleaving data elements from the first plurality of data elements and data elements from the second plurality of data elements in corresponding positions to form a third packed data comprises storing the third element to be held in the first position of the second register to the second position of the first register and maintaining the first element in the first position of the first register to form the third packed data to be held in the first register.
 53. The method of claim 50, wherein the first plurality of data elements is to include a first element to be held in a first position of the first register, a second element to be held in a second position of the first register, a third element to be held in a third position of the first register, and a fourth element to be held in a fourth position of the first register, and wherein the second plurality of data elements is to include a fifth element to be held in the first position of the second register that is to correspond to the first position of the first register, a sixth element to be held in the second position of the second register that is to correspond to the second position of the first register, a seventh element to be held in the third position of the second register that is to correspond to the second position of the first register, and an eighth element to be held in the fourth position of the second register that is to correspond to the fourth position of the first register.
 54. The method of claim 50, wherein interleaving data elements from the first plurality of data elements and data elements from the second plurality of data elements in corresponding positions to form a third packed data comprises: maintaining the first element to be held in the first position of the first register at the first position of the first register, storing the fifth data element to be held in the first position of the second register to the second position of the first register, maintaining the third data element to be held in the third position of the first register in the third position of the first register, and storing the seventh data element to be held in the third position of the second register in the fourth position of the first register to form the third packed data.
 55. A computer readable medium including code having an unpack instruction, the code, when executed, to cause a machine to perform the operations of: decode an unpack instruction; store a first packed data including a first plurality of data elements in a first register; store a second packed data including a second plurality of data elements in a second register; in response to decoding the unpack instruction, interleave the data elements from the first plurality of data elements and data elements from the second plurality of data elements in corresponding positions to form a third packed data.
 56. The computer readable medium of claim 55, wherein the first plurality of data elements is to include a first element to be held in a first position of the first register and a second element to be held in a second position of the first register, and wherein the second plurality of data elements is to include a third element to be held in the first position of the second register that is to correspond to the first position of the first register and a fourth element to be held in the second position of the second register that is to correspond to the second position of the first register.
 57. The computer readable medium of claim 55, wherein interleaving data elements from the first plurality of data elements and data elements from the second plurality of data elements in corresponding positions to form a third packed data comprises storing the third element to be held in the first position of the second register to the second position of the first register and maintaining the first element in the first position of the first register to form the third packed data to be held in the first register. 